Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch tpu calls in send-transaction-service #24083

Merged
merged 46 commits into from
Apr 21, 2022

Conversation

lijunwangs
Copy link
Contributor

@lijunwangs lijunwangs commented Apr 2, 2022

Problem

Send transaction in batches to improve throughput

Summary of Changes

  1. Introduced flag --tpu-do-batch2.
  2. Introduced flag to control the batch size-- by default 100
  3. The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
  4. The batch honor the retry rate on the transaction already sent before.
  5. Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
    Fixes #

@codecov
Copy link

codecov bot commented Apr 4, 2022

Codecov Report

Merging #24083 (697ea14) into master (255a6a7) will decrease coverage by 0.0%.
The diff coverage is 89.5%.

❗ Current head 697ea14 differs from pull request most recent head aca582a. Consider uploading reports for the commit aca582a to get more accurate results

@@            Coverage Diff            @@
##           master   #24083     +/-   ##
=========================================
- Coverage    82.0%    81.9%   -0.1%     
=========================================
  Files         593      593             
  Lines      163827   163956    +129     
=========================================
+ Hits       134340   134399     +59     
- Misses      29487    29557     +70     

Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change set can be much cleaner if we don't special-case batching and instead just batch everything with default batch size of one. Thoughts @CriesofCarrots ?

validator/src/main.rs Outdated Show resolved Hide resolved
validator/src/main.rs Outdated Show resolved Hide resolved
validator/src/main.rs Outdated Show resolved Hide resolved
validator/src/main.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
@lijunwangs
Copy link
Contributor Author

I think this change set can be much cleaner if we don't special-case batching and instead just batch everything with default batch size of one. Thoughts @CriesofCarrots ?

The logics specially added for batching has to be there -- unless you are questioning the logic itself. Using batch size = 1 which is actually disabling batching does not change that fact.

@t-nelson
Copy link
Contributor

t-nelson commented Apr 5, 2022

I think this change set can be much cleaner if we don't special-case batching and instead just batch everything with default batch size of one. Thoughts @CriesofCarrots ?

The logics specially added for batching has to be there -- unless you are questioning the logic itself. Using batch size = 1 which is actually disabling batching does not change that fact.

That's exactly what I'm suggesting. All RPC STS sends should use batching, the default is just size = 1. That is, we drop the old logic entirely in favor of the new batching logic. It will be much cleaner and less bug-prone than this shoehorn implementation

@lijunwangs
Copy link
Contributor Author

lijunwangs commented Apr 5, 2022

I think this change set can be much cleaner if we don't special-case batching and instead just batch everything with default batch size of one. Thoughts @CriesofCarrots ?

The logics specially added for batching has to be there -- unless you are questioning the logic itself. Using batch size = 1 which is actually disabling batching does not change that fact.

That's exactly what I'm suggesting. All RPC STS sends should use batching, the default is just size = 1. That is, we drop the old logic entirely in favor of the new batching logic. It will be much cleaner and less bug-prone than this shoehorn implementation

I don't think it makes sense to drop the old logic all together in case batch size = 1 -- for example firing right away without waiting for further entries. In my opinion, keep that logic is important till we get the confidence of the batch change.

@lijunwangs
Copy link
Contributor Author

I think this change set can be much cleaner if we don't special-case batching and instead just batch everything with default batch size of one. Thoughts @CriesofCarrots ?

The logics specially added for batching has to be there -- unless you are questioning the logic itself. Using batch size = 1 which is actually disabling batching does not change that fact.

That's exactly what I'm suggesting. All RPC STS sends should use batching, the default is just size = 1. That is, we drop the old logic entirely in favor of the new batching logic. It will be much cleaner and less bug-prone than this shoehorn implementation

I don't think it makes sense to drop the old logic all together in case batch size = 1 -- for example firing right away without waiting for further entries. In my opinion, keep that logic is important till we get the confidence of the batch change.

Maybe we can get rid of the send_transaction being called multiple places.

multinode-demo/bootstrap-validator.sh Outdated Show resolved Hide resolved
validator/src/main.rs Outdated Show resolved Hide resolved
validator/src/main.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
validator/src/main.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
validator/src/main.rs Show resolved Hide resolved
validator/src/main.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
@lijunwangs lijunwangs force-pushed the batch_tpu_calls branch 2 times, most recently from 5d03a97 to c05429c Compare April 9, 2022 17:50
@CriesofCarrots
Copy link
Contributor

Fyi, I should have a bench-tps that targets Rpc/STS tomorrow-ish. I have tested rates yet, but hopefully it will be enough TPS to test the pathways you want here.
And if my review is still desired, I'll then look at this.

@lijunwangs
Copy link
Contributor Author

Fyi, I should have a bench-tps that targets Rpc/STS tomorrow-ish. I have tested rates yet, but hopefully it will be enough TPS to test the pathways you want here.
And if my review is still desired, I'll then look at this.

Thanks @CriesofCarrots. It would be great if you can also examine this logic as the change is relatively large in a high traffic path.

@t-nelson
Copy link
Contributor

let me know when this is ready for another pass

@lijunwangs
Copy link
Contributor Author

let me know when this is ready for another pass

Thanks Trent -- the code itself can be reviewed -- it handle the case we are doing async in none batch mode. Right now I am working on setting up a cluster to validate if the batching is making a difference. Will post some results.

@lijunwangs lijunwangs force-pushed the batch_tpu_calls branch 2 times, most recently from 03bc1bd to d3932a2 Compare April 15, 2022 05:34
@lijunwangs
Copy link
Contributor Author

Investigating the local-cluster test failure -- seems to be related to the changes in this PR...

@lijunwangs
Copy link
Contributor Author

r+ @CriesofCarrots approval

Hi Tyera -- will you be able to give it one more pass? Thanks @t-nelson for all the great feedback!

@CriesofCarrots
Copy link
Contributor

Yep, I'm about halfway through. I have to step afk for a bit, but it will be done tonight!

Copy link
Contributor

@CriesofCarrots CriesofCarrots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think all I have are nits!
I saw a few of the things I see you and Trent already talked about, so looking forward to the iterations.

However, one question just to make sure I understand: the batch_send_rate_ms only affects the initial send, right? After that, batches are sent on the retry_rate_ms schedule?

@@ -61,6 +61,15 @@ while [[ -n $1 ]]; do
elif [[ $1 = --enable-rpc-bigtable-ledger-storage ]]; then
args+=("$1")
shift
elif [[ $1 = --tpu-use-quic ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for plumbing these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct @CriesofCarrots. This is a heuristic controls to avoid extreme configurations of the two parameters. It is not a full rate limiting implementation. That will require another PR to address.

validator/src/main.rs Show resolved Hide resolved
validator/src/main.rs Show resolved Hide resolved
validator/src/main.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
send-transaction-service/src/send_transaction_service.rs Outdated Show resolved Hide resolved
lijunwangs and others added 6 commits April 21, 2022 09:27
Co-authored-by: Tyera Eulberg <teulberg@gmail.com>
Co-authored-by: Tyera Eulberg <teulberg@gmail.com>
Co-authored-by: Tyera Eulberg <teulberg@gmail.com>
Co-authored-by: Tyera Eulberg <teulberg@gmail.com>
Co-authored-by: Tyera Eulberg <teulberg@gmail.com>
@mergify mergify bot dismissed t-nelson’s stale review April 21, 2022 16:29

Pull request has been modified.

Copy link
Contributor

@CriesofCarrots CriesofCarrots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@lijunwangs lijunwangs merged commit 7c61e43 into solana-labs:master Apr 21, 2022
mergify bot pushed a commit that referenced this pull request Apr 21, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #

(cherry picked from commit 7c61e43)

# Conflicts:
#	validator/src/main.rs
lijunwangs added a commit to lijunwangs/solana that referenced this pull request Apr 21, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
lijunwangs added a commit to lijunwangs/solana that referenced this pull request Apr 22, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
lijunwangs added a commit that referenced this pull request Apr 22, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Apr 22, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Jun 27, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Jun 27, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Jun 28, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Jun 28, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Jun 28, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Jun 28, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Jun 28, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Jun 28, 2022
Introduced flag --tpu-do-batch2.
Introduced flag to control the batch size-- by default 100
The default batch timeout is 200ms -- configurable. If either it time out or the batch size is filled, a new batch is sent
The batch honor the retry rate on the transaction already sent before.
Introduced two threads in STS: one for receiving new transactions and doing batch send and one for retrying old transactions and doing batch.6.
Fixes #
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants